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IRREGULAR  WAVEFRONTS  IN  DATA-DRIVEN 
DATA-DEPENDENT  COMPUTATIONS 


I 

Rami  G.  Meltaem 


ABSTRACT 

Data  driven  networks  may  be  more  efficient  than  clocked  networks  for  computations 
which  require  data  dependent  local  cycles.  However,  the  performance  of  such  networks  may 
not  be  easily  predicted  because  of  possible  delays  in  computations  due  to  internal  data 
conflicts.  In  this  paper,  a  technique,  which  is  based  on  the  irregular  propagation  of  computa¬ 
tion  fronts,  is  suggested  for  the  study  of  the  behavior  of  networks  with  data  dependent  opera¬ 
tions. 
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IRREGULAR  WAVEFRONTS  IN  DATA-DRIVEN, 
DATA-DEPENDENT  COMPUTATIONS 


RAMI  G.  MELHEM 


INTRODUCTION 


Both  systolic  (Kung  19S2)  and  data  driven  networks  (Kung  1984)  share  the  advantages 
of  efficient  communications  and  fast  specialized  cells  which  repeat  the  execution  of  specific 
local  cycles.  However,  systolic  networks  are  less  flexible  than  data  driven  networks  in  the 
sense  that  if  the  execution  time  of  the  local  cycles  is  not  constant,  then  the  period  of  the  global 
synchronization  clock  should  be  taken  large  enough  to  accommodate  the  slowest  local  cycle. 

In  this  paper,  we'consider.  networks  in  which  the  execution  time  of  local  cycles  depends 
on  the  input  dataA  Typically,  this  may  occur  if  the  local  cycles  contain  branching  statements. 
Although  data  driven  networks  are  self -synchronized,  and  hence,  local  cycles  are  allowed  to 
have  different  execution  times,  it  is  not  obvious  that  the  execution  of  the  entire  network  may 
benefit  from  the  fast  execution  of  some  local  cycles.  More  specifically,  internal  data  conflict 
may  force  a '‘potentially  short^  local  cycle  to  wait  extensively  for  its  input. 

The  study  of  speed  and  efficiency  of  data  driven  networks  with  data  dependent  opera¬ 
tions  is  extremely  hard  due  to  the  asynchronous  nature  of  the  networks.  Hence,  we  suggest  a 
technique  for  the  estimation  of  a  lower  bound  on  the  performance  of  such  networks.  Namely, 
we  introduce  a  simpler,  hypothetical,  type  of  computations,  which  we  call  pseudo-systolic.  It 
is  obtained  by  forcing  some  synchronization  on  the  data  driven  network  such  that  its  execution 
alternates  between  communication  and  processing  phases.  Clearly,  the  additional  synchroniza¬ 
tion  may  only  slow  down  execution,  and  hence,  the  analysis  of  pseudo-systolic  computations 
provide  upper  bounds  on  the  execution  lime  of  the  corresponding  data  driven  computations. 


The  state  of  a  pseudo-systolic  computation  at  any  given  time  may  be  represented  by  a 
computation  front.  However,  unlike  the  wave  front  concepts  described  by  Weiser  and  Davis 
(1981)  and  Kung  (1984),  the  progress  of  the  pseudo-systolic  computation  causes  an  irregular 
propagation  of  the  computation  front.  This  irregularity  reflects  the  differences  in  the  execution 
time  of  the  local  cycles. 


Compulation  fronts  may  be  systematically  constructed  by  the  application  of  some  condi¬ 
tions  which  are  necessary  for  the  consistency  of  data  flow  in  pseudo-systolic  networks.  The 
constructed  fronts  may  then  be  applied  to  the  estimation  of  the  execution  time  of  the 
corresponding  computation.  This  methodology  is  first  illustrated  in  Section  1  with  an  example 
of  a  linear  array  with  simple  cells.  Then  it  is  applied  in  Section  2  to  2-dimensional  arrays  and 
in  Section  3  to  networks  with  complex  cells. 


DATA  DRIVEN  NETWORKS  WITH  TRIVIAL/NONTRIVIAL  LOCAL  CYCLES 


A  data  driven  network  is  defined  here  as  a  set  of  cells,  each  having  a  certain  number  of 
input/output  ports,  and  a  set  of  unidirectional  communication  links,  each  connecting  an  output 
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port  of  some  cell  to  an  input  port  of  another  cell. 

Each  communication  link  directed  from  a  cell  q  to  a  cell  k  is  regarded  as  a  queue  Q  capa¬ 
ble  of  holding  a  certain  number  of  data  items.  It  is  natural  to  assume  that  Q  is  empty  at  the 
beginning  of  the  operation  of  the  network.  However,  it  is  sometimes  useful  to  initialize  Q 
such  that  it  contains  some  data  items.  In  order  to  be  more  specific,  let  QX  and  QY  represent 
two  links  directed  from  cell  q  to  cell  k  .  and  assume  that,  initially.  QY  is  empty  while  QX 
contains  one  data  item  (a  zero  for  example).  Now  if.  during  operation,  cell  q  writes  x  i jr2.... 
and  y^yj....  on  QX  and  QY ,  respectively,  then  the  sequence  of  items  read  by  cell  k  from  the 

same  queues  will  be  0 and  y^yj . respectively.  This  skewing  effect  is  equivalent  to 

the  one  obtained  in  systolic  (clocked)  networks  by  the  insertion  of  a  delay  element  on  the  x- 
stream  communication  line. 


An  Example 


Figure  1:  Matrix/Vector  Multiplication  (w-2) 

The  network  shown  in  Figure  1  may  be  used  for  the  multiplication  of  an  n  Xn  banded 
matrix  A  by  a  vector  x  .  For  simplicity,  we  assume  that  the  number  w  of  lower  diagonals  of 
A  is  equal  to  the  number  of  upper  diagonals,  and  hence,  the  bandwidths  of  A  is  W  -2w  +1. 

The  network  is  composed  of  W  cells,  where  each  two  consecutive  cells  k  and  k  —1  are 
connected  by  two  links,  which  we  call  the  x-link  and  the  y-link.  The  queues  on  the  y-links 
are  set  to  be  initially  empty  and  those  on  the  x-links  are  set  to  contain  a  single  data  item, 
namely  a  "zero".  Using  the  naming  convention  shown  in  Figure  1  for  the  I/O  ports,  and  using 
the  notation  O  *-  a  and  0  •-  (/]  to  indicate  that  a  is  written  on  port  O  and  that  the  value  at 
port  I  is  read  into  0.  respectively,  we  may  describe  the  operation  of  each  cell  in  the  network 
by  the  following  algorithm: 

ALG1:  Repeal  Forever 

1)  wait  until  the  queues  at  /  ,.  1 2  and  1 3  are  not  empty 

2)  (  ^[/i]:T)-r/2J;c«*-U3J 

3)  7}  *  7)  +  a  *  ( 

4)  wait  until  the  queues  at  0  j  and  0  j  are  not  full 

3)  Ot-e.oj-r) 

The  five  steps  in  ALG1  constitute  a  local  cycle  which  is  repeated,  indefinitely,  by  each 
cell  in  the  network.  We  assume  that  the  computation  time  (step  3)  and  the  communication 
time  (steps  2  and  3)  do  not  depend  on  the  values  of  the  data  items  being  processed,  and  we 
denote  these  times  by  rm  and  re .  respectively.  With  this,  the  execution  time  of  any  local  cycle 
which  is  not  delayed  by  a  wait  in  steps  1  or  4  is  given  by  rc  +rm .  It  is  clear,  however,  that  the 
time  of  a  local  cycle  may  be  longer  than  rc  +r„  if  execution  is  delayed  in  steps  1  or  4. 

DEFINITION:  The  "basic  local  cycle  time"  is  the  time  needed  to  complete  the  execution  of  a 
local  cycle,  excluding  any  delay  caused  by  a  wait  for  new  input  or  a  wait  for  the  consumption 
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of  old  output. 

The  sequence  of  inputs  to  the  network  is  shown  in  Figure  1  are  set  to  zero).  If 

each  input  is  made  available  to  the  network  as  soon  as  it  is  needed,  then  it  is  easy  to  verify 
that  the  elements  y,,i  -  1...../1  of  the  product  vector  y  —  Ax  are  produced  at  port  of  cell 
— w  at  time  (2+t  +W  )(rc  +rm ). 

The  assumption  that  the  basic  local  cycle  time  is  constant  leads  to  a  mode  of  operation  in 
which,  after  few  initial  cycles,  all  the  local  cycles  are  automatically  synchronized.  The  pro¬ 
gress  of  the  computation  in  this  case  may  be  best  represented  by  the  propagation  of  a  computa¬ 
tion  front  as  shown  in  Figure  1. 

For  highly  sparse  matrices,  the  above  network  is  clearly  inefficient  because  most  of  the 
elements  received  on  ports  / 3  of  cells  — w . w  are  zeroes,  thus  leading  to  trivial  and  unneces¬ 

sary  computations  in  step  3  of  ALGl.  In  this  case,  we  may  improve  the  performance  of  the 
network,  and  in  the  same  time  reduce  the  amount  of  its  communication  with  the  outside 
world,  by  supplying  only  the  non  zero  element  of  A  along  with  their  positions.  More 
specifically,  the  input  au+t  to  port  13  of  cell  k  is  omitted  if  a,  i+t  =  0.  while  if  a,  j  s*  0. 
the  value  of  the  .index  t  is  supplied  on  an  additional  port,  which  we  call  I4.  The  precise  opera¬ 
tion  of  each  cell  may  be  described  by  the  following  algorithm: 

ALG2:  CT  —  —2:  a  «-  [/ J:  1  *-  [/4]  /*  CT  is  a  local  counter  */ 

Repeat  Forever 

1)  wait  until  the  queues  at  /}  and  1 2  are  not  empty 

2)  f  -  [/J;  V  -  [/*]:  CT  =  CT  +  1 

3)  If(i  *  CT  )  then  3.1)  7}  *  7)  +  a  *  { 

3.2)  a-[/,]:i-[/4] 

4)  wait  until  the  queues  on  O  t  and  02  are  not  full 

5)  Oi  —  i-03*-  1) 


The  input  to  the  modified  network  is  shown  in  Figure  2  for  a  specific  sparse  matrix  A  . 
Any  element  of  A  which  is  not  included  in  the  input  is  assumed  to  be  zero. 

Unlike  ALGl.  the  basic  local  cycle  time  in  ALG2  depends  on  the  input  data.  More 
specifically,  it  may  assume  one  of  two  values  depending  on  the  result  of  the  comparison  in  step 
3.  Let  the  two  values  be  r8  and  Tj  for  cycles  which  skip,  and  do  not  skip,  respectively.  step3. 

The  architecture  of  each  cell  and  the  technology  used  to  construct  the  network  determine 
r#  and  r,.  However,  if  we  assume  that  the  communication  protocols  are  implemented  in 
hardware,  and  that  the  operations  in  step  3.1  are  floating  point  operations,  then  it  is  reasonable 
to  assume  that  r0  «Tj.  We  will  call  an  execution  of  a  cycle  which  skips  steps  3.1  and  3.2  a 
trivial  execution  of  the  cycle.  Although  trivial  executions  of  local  cycles  reduce  the  average 
basic  local  cycle  time,  the  efficiency  of  the  entire  network  is  determined  by  the  delay  intro¬ 
duced  in  steps  1  and  4  of  local  cycles.  More  specifically,  a  computational  cell  executing  a  non¬ 
trivial  cycle  may  hold  data  which  is  necessary  for  the  execution  of  a  trivial  local  cycle  in  a 
neighboring  cell.  In  other  words,  the  execution  time  of  the  network  is  determined  by  tnternal 
data  conflict. 
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Given  the  asynchronous  nature  of  the  computation,  it  seems  that  simulation  is  the  only 
method  for  the  precise  estimation  of  its  execution  ,  time.  However,  this  requires  the 
specification  of  r0  and  which  depends  on  architecture  and  technology.  In  the  next  section, 
we  introduce  a  method  which  isolates  and  measures  the  effect  of  internal  data  conflict  on  the 
performance  of  data  driven/data  dependent  networks. 

Pseudo  Systolic  Synchronization  and  Irregular  Computation  Fronts 

In  order  to  establish  an  upper  bound  on  the  execution  time  of  networks  with  trivial/ non 
trivial  local  cycles,  we  consider  a  hypothetical  mode  of  operation  which  we  call  "pseudo  sys¬ 
tolic".  It  is  obtained  by  adding  some  global  synchronization  to  the  network  such  that  com¬ 
munication  and  computation  take  place  in  two  alternating  phases.  Clearly,  the  delay  intro¬ 
duced  by  the  additional  synchronization  may  only  slow  down  execution,  and  hence,  a  study  of 
the  pseudo  systolic  mode  of  operation  may  be  viewed  as  a  worst  case  analysis  of  the  asynchro¬ 
nous  mode  of  operation. 

More  specifically,  we  assume  that  all  the  cells  in  the  network  are  connected  to  a 
"hypothetical"  controller  which  senses  the  state  of  execution  of  each  cell.  The  controller, 
presumably,  forces  successive  executions  of  trivial  local  cycles  of  all  the  cells  to  take  place  in  a 
single  phase  which  we  call  a  communication  phase.  This  phase  is  then  followed  by  a  process¬ 
ing  phase,  in  which  only  non  trivial  executions  of  local  cycles  take  place.  A  communication 
phase  followed  by  a  processing  phase  is  called  a  global  cycle. 

For  example,  a  pseudo  systolic  version  of  the  network  of  Figure  2  may  be  obtained  by 
replacing  step  3  in  ALG2  with  the  following 

3)  IF  (c  -  CT )  THEN  3.1)  wait  for  SYNC 

3.2)  7)  =  T)  +  or*  f 

3.3)  a  «-  [/3];  i  *-  [/«] 

where  SYNC  is  a  signal  sent  by  the  hypothetical  controller  at  the  end  of  a  communication 
phase.  More  specifically,  during  a  communication  phase,  data  is  moving  in  the  network  until 
each  cell  is  either  blocked  in  steps  1  or  4  due  to  data  conflict  or  blocked  in  step  3.1.  At  this 
point  the  controller  issues  SYNC,  and  all  the  cells  which  are  blocked  in  step  3.1  execute  3.2 
and  3.3  simultaneously,  while  the  other  cells  remain  idle.  This  is  a  processing  phase. 


-2  -1  0  1  2  k 


Let  M,  be  the  subset  of  cells  which  are  not  idle  during  the  processing  phase  of  the  t-th 
global  cycle.  Clearly,  only  the  cells  in  Mt  contribute  to  the  advancement  of  the  computation 
during  the  t-th  global  cycle.  Hence,  the  progress  of  the  computations  may  be  represented  by 
the  propagation  of  a  front  which  includes  the  data  items  being  operated  upon  by  cells  in  Af, . 


-5- 


More  specifically,  if  a(t  Jc )  =  at ,  .  where  a,  ,  +k  is  the  element  of  A  which  is  at  cell  k  at  the 

beginning  of  the  t-th  processing  phase,  then  the  i-th  computation  front  may  be  defined  by 

CF,  =  {a(r  Jc ) :  *  e  M, } 

For  example,  we  show  in  Figure  3  the  computation  fronts  for  the  matrix/ vector  operation 
given  in  Figure  2.  The  number  of  fronts,  namely  7.  indicates  the  number  of  global  cycles. 
Although,  for  small  computations,  it  is  possible  to  construct  the  fronts  by  tracing  the  execu¬ 
tion  of  the  network,  it  is  obvious  that  a  more  systematic  construction  method  is  needed  for 
large  computations.  This  is  discussed  in  the  next  subsection. 

Consistency  of  Data  Flow  Conditions 


Figure  4.  Data  flow  in  a  linear  array 


Let  2iJ j....  be  a  sequence  of  data  items  which  are  flowing  through  a  linear  array  of  cells 
Ci.c2....  as  shown  in  Figure  4.  Let  also  dt  be  the -maximum  capacity  of  each  queue 
corresponding  to  a  communication  link  in  the  array.  Clearly,  during  any  particular  instant, 
the  capacity  of  the  queues  should  not  be  exceeded,  and  the  order  of  items  in  the  data  stream 
should  be  preserved.  This  may  be  formally  stated  as  follows: 

CONDITION  ( 1 ):  If  q  >  k  .  and  at  any  instant.  2,  is  at  cell  ct  and  zt  is  at  cell  cq  ,  then 

dt(q-k)  >  l-i  >  0.  (1) 

CONDITION  (2):  If  2,  and  z,  arrive  at  cell  ck  at  times  t  and  r.  respectively  then 

l  <  i  ===>  t  <  r.  (2) 

The  above  two  conditions,  which  are  necessary  and  sufficient  for  the  consistency  of  data 
streams  in  linear  arrays,  may  be  used  to  derive  the  relations  between  the  elements  of  computa¬ 
tion  fronts  in  pseudo  systolic  computations.  For  example,  consider  the  multiplication  network 
described  in  the  last  section,  and  let  a,  ,  +t  and  a,  j  ^  .  be  two  elements  in  a  specific  front  CF, . 
This  means  that  both  cells  k  and  q  are  active  during  the  t-th  global  cycle,  and  hence,  y,  and 
xi+k  are  at  ports  Ij  and  /,  of  cell  k  .  respectively,  and  y,  and  xtHf  are  at  ports  /2  and  /,  of 
cell  q  .  respectively.  Assuming  that  the  capacity  of  the  x-links  and  y-links  in  the  network  are 
dx  and  dy  ,  respectively,  we  may  apply  condition  (l)  to  get 

dy  ( q—k )  ^  ( l  — i  )  >  0  and  dx  (q—k)^(l+q)  —  ( i  +k  )  >  0 


that  is 

min  {<f,  Ax  —  1}  ^  — — 7-  >  0.  (3) 

7  q—k 

Along  the  same  line  of  thinking,  condition  (2)  is  translated  to 

(a,  iCF,Aij^k  €  CFt)  and  (i  >  l)  *==>  t  >  r.  (4) 

Computation  fronts  for  any  specific  input  matrix  pattern  may  be  constructed  by  apply¬ 
ing  (3)  and  (4).  More  specifically,  if  each  element  a, ,  of  the  input  is  associated  with  a  posi¬ 
tion  (i  jfc  )  as  shown  in  Figure  3.  then,  equation  (3)  bounds  the  slope  of  the  line  segment  joining 
any  two  elements  in  the  same  front.  For  example,  for  dy  =3  and  d,  =4.  we  obtain 

®i. it*  •  1 *CF,  =*«>  3  £  q—k  > 


(5) 
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Also  (4)  implies  that  computation  fronts  may  not  overlap.  Hence,  starting  from  the  right 
upper  element  in  the  graphical  representation  of  the  input  (Figure  3).  we  may  construct  the 
fronts  recursively.  More  specifically,  given  CF  \ . CFt  _1§  we  construct  CF,  such  that 

1)  it  includes  as  many  elements  of  A  as  allowed  by  (5).  and 

2)  there  is  no  non  zero  elements  of  A  between  CF,  and  CF,  _j. 

As  the  above  example  shows,  conditions  (l)  and  (2)  may  be  used  to  design  algorithm  for 
the  automatic  construction  of  the  computation  fronts,  thus  providing  a  method  for  the  deter¬ 
mination  of  the  number  of  global  cycles  needed  to  complete  the  computations. 

Finally,  we  would  like  to  note  that  the  structure  of  each  local  cycle  does  affect  the  per¬ 
missible  slopes  of  the  computation  fronts.  More  specifically,  the  condition  (1)  was  derived 
assuming  that  each  cell  in  Figure  4  performs  a  cycle  of  the  form  [read  z  ;  compute  :  write  z  ]. 
If.  however,  the  value  of  z  is  not  altered,  then  it  is  possible  to  write  the  local  cycle  in  the  form 
[read  z  :  write  z  ;  compute].  In  this  case,  the  condition  (1)  becomes 

dt  (q  -k  )  >  l  -i  >  0.  (6) 


The  Execution  Time  of  Pseudo-Systolic  Computations 


Let  N  be  the  number  of  global  cycles  needed  to  complete  a  specific  pseudo  systolic  com¬ 
putation.  If  r0  «t1,  then  the  execution  time  of  the  computation  is  T  ~N  rv  However,  if  we 
are  not  willing  to  neglect  r0  .  then  T  may  be  computed  from 

T  =  £  (Af  +  t,  -  t0  )  (7) 

t  =i 


where  A,  is  the  time  for  the  communication  phase  in  the  t-th  global  cycle.  An  upper  bound 
may  be  imposed  on  A, ,  (  =  1....JV  by  studying  the  changes  in  the  data  profiles  which  take 
place  during  the  communication  phases.  This  is  illustrated  in  the  remaining  of  this  section  by 
means  of  the  previous  matrix/vector  multiplication  example. 

First,  we  define  for  each  global  cycle  t  the  x  and  y  data  profiles.  Namely,  the  x-data 
profile  is  a  function  xP, :  [— w  .w  ]  x  {0.1,2....}  - *  { 1..../1 }.  defined  such  that  xP,  (k  ,u)  —  i  if.  at 
the  end  of  the  t-th  global  cycle,  x,  is  the  u-th  element  in  the  x-link  queue  at  cell  k  .  and 
xP,  (k  .u  )  =  T  (undefined)  if  the  length  of  that  queue  is  less  than  u .  For  simplicity,  we  let 
xP,  ( k  .0)  be  the  last  value  of  x  read  by  cell  k  during  the  t-th  global  cycle.  The  y-data  profile 
is  defined  in  a  similar  way. 

Next,  we  show  how  to  construct  xP,  from  CF,  (the  construction  of  yP,  is  similar).  For 
each  cell  keM, ,  we  know  that  xP,  (k  .0 )=i  +k  .  where  a, ,  +k  eCF, .  Moreover,  given  any  two 
cells  k  .qeM,  such  that  1)  a,  ,+k  .  au+1eCF,  and  2)  for  k  <c  <q  ,  c  is  not  in  Af, .  we  know 
that  the  elements  x,+k  ....^c/+?_1  should  occupy  consecutive  locations  on  the  x-link  queues, 
starting  at  cell  k  .  In  other  words,  the  profile  between  cells  k  and  q  is  obtained  from: 

IF  (k  is  the  largest  cell  in  M,  )  THEN  last  =  n  ELSE  last  =  l  +  q  —  1 
0  =  0  :u  =0 

For  \  =  i  +k  ....dost  DO 

xP,  ( k  +/3.u  )  =  \ 

IF  (u  <  dx  )  THEN  u  =  u  +  1  ELSE  u  =  0;  0  =  0  +  1 
For  c  -  0+1,...^  —  1  DO  xP,  (c  .0)  =  last 

At  time  0.  we  may  assume  that  all  the  elements  are  stored  in  the  buffer  of  cell  w  .  that  is 


xP0  (k  jq  )  = 


q  if  k  =  w  and  q  =  l.  -./i 
T  otherwise. 


We  also  define  the  pseudo  inverse  function  loc,  such  that  loc,  ( x ,  )-k  if  xP,  ( k  ,u  )—j  for  some 
u.  Now.  during  the  t-th  communication  phase  (denoted  from  now  on  by  CP,).  The  data 
movement  in  the  network  causes  a  change  in  the  x  and  y  profiles  from  xP,  and  yP,  _j  to  xP, 
and  yP, .  If  we  assume  that  dx  -dy  .  then,  it  is  easy  to  show  that  xP,  (.k  ,u)  -  yP,  (k  m  )+k  . 


Hence,  changes  in  the  two  profiles  occur  simultaneously  and  the  time  A,  of  CP,  may  be  calcu¬ 
lated  by  considering  only  the  change  in  one  profile,  say  the  x-profile. 

Consider  a  specific  cell  k  and  let  xP,  .0)  =  i  +k  and  xP,  (k  ,0)  =  L  +k  .  Clearly,  dur¬ 
ing  CP, .  cell  k  should  execute  xP,  (/ fc  ,0 )—xP,  -X(k  .0)  -  l  —i  trivial  local  cycles.  On  the  other 
hand,  if  loc,  _i(x,  +*)  =  £’.  then  xt  +k  should  travel  across  (k  '—k  )  cells  during  CP, .  in  order  to 
reach  its  position  in  xP, .  In  other  words,  the  communication  activity  in  cell  k  during  CP, 
requires  a  time 

A ,(k)  <  (xP,  (k  .0)  —  xP,  _i(*  ,0))to  +  (fc  ‘—k  )r0 . 

But  CP,  terminates  when  the  communication  activities  in  all  the  cells  terminate,  that  is 
A,  =  max  {A,  (k  );  — w  ^  k  ^  w  }. 

This,  when  used  in  (7)  provides  an  estimate  for  the  execution  time  of  the  pseudo  systolic 
computation  in  terms  of  the  parameters  r0  and  Tj. 


APPLICATION  TO  2-D  NETWORKS 


In  this  section  we  demonstrate  how  to  construct  the  computation  fronts  for  2-D  compu¬ 
tations  with  trivial/non  trivial  local  cycles.  More  specifically,  we  consider  the  multiplication 
C  =  AB  of  two  n  Xn  banded  sparse  matrices  on  a  (2w +1)  x  (2w +1)  array  (Weiser  and 
Davis.  1981).  where  w  is  the  half  band  width  of  A  and  B .  The  array  is  shown  in  Figure  5 
along  with  its  input.  For  simplicity,  we  assume  that  all  the  elements  in  the  bands  of  A  and  B 
are  supplied  to  the  network,  and  that  each  cell  is  capable  of  recognizing  and  skipping  trivial 
operations.  We  also  assume  that  the  capacities  of  the  queues  on  the  internal  communication 
links  are  arbitrarily  large  and  that  each  queue  does  initially  contain  one  data  item,  namely  a 
zero.  More  precisely,  the  operation  of  each  cell  is  described  by 
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Figure  5.  A  matrix/matrix  multiplication  computation 
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ALG3:  Repeat 

1)  wait  until  the  queues  on  ports  I\.  1 2.  and  / 3  are  not  emptv 

2)  a -[7x1:0^  [/,):>-[/ 3] 

3)  Ox  -a:02  -  J3 

4)  IF  (a  *  0  and  0*0)  THEN4.2)  wait  for  SYNC 

4.2)  y  =  y  +  a  *  0 

5)  O  3  *-  V 

Note  that  in  ALG3.  we  assumed  pseudo  systolic  operation.  Of  course,  the  normal  data 
driven  operation  may  be  obtained  by  removing  step  4.1  from  the  local  cycle  of  each  cell. 

w 

The  multiplication  C  =AB  may  be  decomposed  into  C  =  £  ABr ,  where  each  Br  con- 

r  s— w 

tains  the  elements  in  the  r-th  off  diagonal  of  B .  In  the  network  of  Figure  5.  the  operation 
Cr  =  ABr  is  performed  by  the  cells  in  row  r  of  the  2-D  array.  Given  that  each  row  (r  Jc  ). 
k  =  —w  of  cells  is  a  linear  array,  we  may  apply  the  same  rules  discussed  earlier  to  the 
construction  of  the  fronts  for  that  row.  These  fronts  will  be  denoted  by  CF  [r )  .CF  jr ' . 

The  t-th  computation  front  of  row  r  ,  CF,^r\  may  be  defined  as  either  the  set  of  elements 
of  A  .  or  the  set  of  elements  of  B  .  which  are  operated  upon  during  the  t-th  global  cycle.  The 
two  sets  are  related  and  may  be  derived  from  each  other.  Here,  we  will  choose  the  first  alter¬ 
native.  that  is.  we  will  use  A  to  represent  the  propagation  of  the  computation. 

The  condition  which  relates  the  elements  in  the  same  front  may  be  derived  as  follows: 
Let  ,+i  and  a/;+v  be  in  CF,(r  \  Then  bt  +t  ,  +i  and  h,  ^  ;  should  be  cells  (r  Jc)  and 

(r  jq ).  respectively,  during  the  t-th  global  cycle.  Noting  that  the  values  on  the  b  data  stream 
are  written  in  ALG3  as  soon  as  they  are  read,  we  may  apply  (6)  to  get 

00  >  s*  -1  (8) 

q-k 

which  determines  the  slope  of  the  computation  fronts.  In  Figure  6.  we  show  the  fronts  for  the 
computation  of  Figure  5.  Note  that,  for  any  given  row  r  .  some  non  zero  element  of  A  are  not 
included  in  any  front  (in  Figure  6.  these  elements  are  masked  by  a  circle).  More  specifically,  if 

bj  j  =0.  for  some  j  ,  then  by  ALG3  the  operation  Oj  ;  *bj  }  ^  is  skipped  for  q  -—w . w  , 

and  hence  the  elements  aj-qj  are  excluded  from  any  front  for  row  r  . 
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(a)  for  row  2 


(b)  for  row  1  (c)  at  global  cycle  4 

Figure  6.  The  computation  fronts 


Unfortunately,  the  computations  performed  by  the  different  rows  in  the  array  are  not 
independent.  Namely,  row  r  —1  receives  data  from  row  r  .  Loosely  speaking,  this  implies  that 
CF,{r~ u  follows  C7r,(r).  Stated  differently,  during  any  global  cycle  t.  the  fronts 
CF,*"*  \....CF,lw )  are  ordered  in  a  non  overlapping  way. 


'ir's:  V,  r 


-9- 


In  order  to  be  more  specific,  consider  two  consecutive  rows  r  and  r—1.  and  assume  that 
a;  ,i+*  *CF,<'r)  and  aUHl  iCF,(r~l)  with  k  >  q .  The  application  of  the  condition  (6)  on  both 
the  a-data  stream  and  the  result  stream  along  the  45°  links  gives 

l  <  i  +  (k  — q  —1).  (9) 

This  means  that  if  C£/r)  passes  through  position  (i  Jc  ).  for  some  i  and  k  .  then  may 

not  pass  by  a  position  (l  4  )  which  violates  (9).  For  a  fixed  i  and  k  .  equation  (9)  is  a  straight 
line  with  slope  135°  starting  at  (i  Jc  ).  Hence.  (Z  ,q  )  should  not  cross  that  line.  The  piecewise 

linear  curve  composed  from  such  lines  for  all  the  elements  in  C£,(r)  is  called  the  envelope  of 

CF ,(r  \  which  we  denote  by  £:(r The  envelope  £,<r '  should  be  used  to  separate  CF,(r  -1)  from 
CF,(r ^  during  the  construction  of  the  former  (see  Figure  6). 

In  summary,  after  the  construction  of  the  computation  fronts  C£,(r).  t  =  1,2,...  for  row 
r  .  the  fronts  for  row  r  —1  may  be  constructed  as  follows. 

1)  Mask  the  elements  of  A  which  correspond  to  zeroes  in  the  (r  — 1  )—st  diagonal  of  B  . 

2)  Compute  the  envelopes  £,(r  \t=  1.2,...  for  the  fronts  in  row  r  . 

3)  Construct  the  fronts  in  row  r  —1  recursively,  such  that,  given  CF . CF, (Ij-1 1 . 

the  front  C£/r-1)  satisfies  the  following. 

1)  it  contains  as  many  elements  of  A  as  allowed  by  (8). 

2)  there  is  no  non  2ero  elements  of  A  between  C£r<r-1)  and  CF^Lf^ 

3)  C£/''“1)  does  not  cross  £,(r ) 


COMPUTATIONS  WITH  COMPLEX  DATA  DEPENDENT  CYCLES 


The  concept  of  irregular  computation  fronts  may  be  applied  to  data  dependent  network 
even  when  the  execution  of  each  local  cycle  is  more  complex  than  the  simple  trivial/non  trivial 
scheme  discussed  so  far.  More  specifically,  we  will  discuss  the  case  in  which  the  basic  local 
cycle  time  may  be  any  multiple  of  a  unit  time  In  this  case  a  pseudo  systolic  computation 
may  be  assumed,  such  that  successive  snap  shots  of  the  computation  fronts  are  obtained  at  rl 
time  intervals. 

For  example,  let  each  input  item  a,  ,  +k  in  Figure  1  be  a  sparse  m  X  m  submatrix  and 
each  y i  and  x,+k  be  an  m-dimensional  subvector,  and  assume  that  step  3  of  ALGl  is  a  "smart" 
matrix/vector  operation  which  skips  trivial  multiplications  and  additions.  In  other  words,  the 
time  consumed  in  the  operation  y,  =  y,  +  ajj+t  in  step  3  is  equal  to  p,  Ti-  where  p,  ,  +t 
is  the  number  of  non  zero  elements  in  a,  t  +i  and  rx  is  the  time  of  a  multiply /add  operation. 

The  consistency  of  data  flow  condition  (l)  may  be  applied  to  derive  the  same  equation  (5) 
which  governs  the  slopes  of  the  computation  fronts.  However,  a  cell  which  operates  on  an  ele¬ 
ment  a{  terminates  the  corresponding  local  cycle  in  time  p,  i+k  TIt  which  means  that  />,  ,+* 
computation  fronts  should  pass  through  a{  .  In  Figure  7,  we  show  the  computation  fronts 
for  the  case  in  which  each  input  a{  1+i  is  a  2  X  2  submatrix,  and  hence  O^p,  ,+t  <4.  Non  zero 
elements  are  simply  denoted  by  an  x  . 

The  time  for  communication  phases  may  be  computed  as  discussed  earlier,  except  that, 
now.  the  time  r0  required  for  steps  2  and  5  is  the  time  for  the  transmission  of  m  X  m 
matrices  and  m-dimensional  vectors. 

It  was  noticed  earlier  that  the  separation  between  any  two  fronts  CF,  _j  and  CF,  is  an 
indication  of  the  communication  activities  during  the  t-th  global  cycle.  However,  as  seen  in 
Figure  7.  only  a  small  fraction  of  the  cells  are  involved  in  communication  during  any  specific 
global  cycle,  due  to  the  slow  propagation  of  fronts.  Hence,  the  separation  between  communi¬ 
cation  activities  and  computation  activities  does,  in  this  case,  result  in  an  estimate  of  the  exe¬ 
cution  time,  which  is  rather  pessimistic. 


1*  N  *»  '% 
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- xO - XX - xO - xO - Ox - CF  x 


Figure  7.  Partitioned  Matrix/Vector  multiplication 


A  less  pessimistic  bound  on  the  execution  time  may  be  obtained  if  we  assume  that  r0  is  a 
multiple  or  Tj  of  the  unit  time  Tj  and  redefine  a  global  cycle  to  be  a  span  r,  of  time  during 
which  any  particular  cell  may  process  data  (step  3  of  ALG1).  communicate  data  (steps  2  and 
5)  or  sit  idle  (steps  1  and  4).  In  this  case,  the  number  of  fronts  passing  through  any  particular 
element  a{  i+4  is  augmented  to  p,  f  +cr.  and  the  execution  time  of  the  network  is  given  by 
N  r j  where  N  is  the  total  number  of  fronts.  We  will  not  discuss  this  approach  here  any 
further. 


CONCLUSION 


We  presented  a  methodology  for  the  estimation  of  the  efficiency  of  data  driven  networks 
with  data  dependent  operations.  It  is  based  on  the  irregular  propagation  of  computation  fronts 
and  it  takes  into  account  the  effect  of  internal  data  conflict  without  any  assumption  about  the 
architecture  of  the  network  or  the  technology  used  to  implement  it.  The  main  objective  of  the 
paper  was  to  illustrate  the  methodology  and  the  different  related  concepts.  Its  application  to 
the  performance  study  of  specific  networks  is  presented  in  Melhem  (1986a)  and  (1986b). 
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